Greetings! Today we analyse dataset: Expression levels of 77 proteins measured in the cerebral cortex of 8 classes of control and Down syndrome mice exposed to context fear conditioning, a task used to assess associative learning. You can downlodad ata here: https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression
Tasks are: 1. Give a dataset description:
-how many mice?
-how many groups?
-are these groups balanced?
-check for NA
Perform ANOVA for BDNF_N~class
Make a linear model for ERBB4_N prediction on the basis of other proteins data:
-conduct diagnostics of the model;
-is it good solution?
-make an ordination;
-plot scores;
-find ‘% explained’ for each PC;
-plot 3D plot for first 3 PCs.
pckgs <- c('tidyverse', 'ggplot2', 'FactoMineR', 'multcompView', 'vegan', 'plotly','psych','car','gridExtra','ggpubr','rstatix','RColorBrewer')
for(i in pckgs){
if(!require(i, character.only = T)){
install.packages(i, dependencies = T)
library(i)
}
}
data_mice <- readxl::read_xls('Data_Cortex_Nuclear.xls')
According to the dataset description, there are 72 mice with 15 experiments each. Working with such a complex ID(e.g. 309_1) is not so convinient,so I decided to split Mouse_ID column into 2 columns: Mouse_ID and Experiment
data_mice <- separate(data_mice,col = 1,into=c('Mouse_ID','Experiment'),sep = '_',remove=T)
We also need to change type of some variables:
data_mice$Mouse_ID <- as.factor(data_mice$Mouse_ID)
data_mice$Experiment <- as.factor(data_mice$Experiment)
data_mice$class <- as.factor(data_mice$class)
data_mice$Genotype <- as.factor(data_mice$Genotype)
data_mice$Treatment <- as.factor(data_mice$Treatment)
data_mice$Behavior <- as.factor(data_mice$Behavior)
P.S. I found this elegant solution to get rid fo manual changing data type here: https://gist.github.com/ramhiser/93fe37be439c480dc26c4bed8aab03dd
data_mice <- data_mice %>% mutate_if(is.character,as.factor)
Let’s verify mice count (we could also take a look at the structure):
mouse_count <- length(unique(data_mice$Mouse_ID))
str(data_mice)
## tibble [1,080 × 83] (S3: tbl_df/tbl/data.frame)
## $ Mouse_ID : Factor w/ 72 levels "18899","293",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Experiment : Factor w/ 15 levels "1","10","11",..: 1 8 9 10 11 12 13 14 15 2 ...
## $ DYRK1A_N : num [1:1080] 0.504 0.515 0.509 0.442 0.435 ...
## $ ITSN1_N : num [1:1080] 0.747 0.689 0.73 0.617 0.617 ...
## $ BDNF_N : num [1:1080] 0.43 0.412 0.418 0.359 0.359 ...
## $ NR1_N : num [1:1080] 2.82 2.79 2.69 2.47 2.37 ...
## $ NR2A_N : num [1:1080] 5.99 5.69 5.62 4.98 4.72 ...
## $ pAKT_N : num [1:1080] 0.219 0.212 0.209 0.223 0.213 ...
## $ pBRAF_N : num [1:1080] 0.178 0.173 0.176 0.176 0.174 ...
## $ pCAMKII_N : num [1:1080] 2.37 2.29 2.28 2.15 2.13 ...
## $ pCREB_N : num [1:1080] 0.232 0.227 0.23 0.207 0.192 ...
## $ pELK_N : num [1:1080] 1.75 1.6 1.56 1.6 1.5 ...
## $ pERK_N : num [1:1080] 0.688 0.695 0.677 0.583 0.551 ...
## $ pJNK_N : num [1:1080] 0.306 0.299 0.291 0.297 0.287 ...
## $ PKCA_N : num [1:1080] 0.403 0.386 0.381 0.377 0.364 ...
## $ pMEK_N : num [1:1080] 0.297 0.281 0.282 0.314 0.278 ...
## $ pNR1_N : num [1:1080] 1.022 0.957 1.004 0.875 0.865 ...
## $ pNR2A_N : num [1:1080] 0.606 0.588 0.602 0.52 0.508 ...
## $ pNR2B_N : num [1:1080] 1.88 1.73 1.73 1.57 1.48 ...
## $ pPKCAB_N : num [1:1080] 2.31 2.04 2.02 2.13 2.01 ...
## $ pRSK_N : num [1:1080] 0.442 0.445 0.468 0.478 0.483 ...
## $ AKT_N : num [1:1080] 0.859 0.835 0.814 0.728 0.688 ...
## $ BRAF_N : num [1:1080] 0.416 0.4 0.4 0.386 0.368 ...
## $ CAMKII_N : num [1:1080] 0.37 0.356 0.368 0.363 0.355 ...
## $ CREB_N : num [1:1080] 0.179 0.174 0.174 0.179 0.175 ...
## $ ELK_N : num [1:1080] 1.87 1.76 1.77 1.29 1.32 ...
## $ ERK_N : num [1:1080] 3.69 3.49 3.57 2.97 2.9 ...
## $ GSK3B_N : num [1:1080] 1.54 1.51 1.5 1.42 1.36 ...
## $ JNK_N : num [1:1080] 0.265 0.256 0.26 0.26 0.251 ...
## $ MEK_N : num [1:1080] 0.32 0.304 0.312 0.279 0.274 ...
## $ TRKA_N : num [1:1080] 0.814 0.781 0.785 0.734 0.703 ...
## $ RSK_N : num [1:1080] 0.166 0.157 0.161 0.162 0.155 ...
## $ APP_N : num [1:1080] 0.454 0.431 0.423 0.411 0.399 ...
## $ Bcatenin_N : num [1:1080] 3.04 2.92 2.94 2.5 2.46 ...
## $ SOD1_N : num [1:1080] 0.37 0.342 0.344 0.345 0.329 ...
## $ MTOR_N : num [1:1080] 0.459 0.424 0.425 0.429 0.409 ...
## $ P38_N : num [1:1080] 0.335 0.325 0.325 0.33 0.313 ...
## $ pMTOR_N : num [1:1080] 0.825 0.762 0.757 0.747 0.692 ...
## $ DSCR1_N : num [1:1080] 0.577 0.545 0.544 0.547 0.537 ...
## $ AMPKA_N : num [1:1080] 0.448 0.421 0.405 0.387 0.361 ...
## $ NR2B_N : num [1:1080] 0.586 0.545 0.553 0.548 0.513 ...
## $ pNUMB_N : num [1:1080] 0.395 0.368 0.364 0.367 0.352 ...
## $ RAPTOR_N : num [1:1080] 0.34 0.322 0.313 0.328 0.312 ...
## $ TIAM1_N : num [1:1080] 0.483 0.455 0.447 0.443 0.419 ...
## $ pP70S6_N : num [1:1080] 0.294 0.276 0.257 0.399 0.393 ...
## $ NUMB_N : num [1:1080] 0.182 0.182 0.184 0.162 0.16 ...
## $ P70S6_N : num [1:1080] 0.843 0.848 0.856 0.76 0.768 ...
## $ pGSK3B_N : num [1:1080] 0.193 0.195 0.201 0.184 0.186 ...
## $ pPKCG_N : num [1:1080] 1.44 1.44 1.52 1.61 1.65 ...
## $ CDK5_N : num [1:1080] 0.295 0.294 0.302 0.296 0.297 ...
## $ S6_N : num [1:1080] 0.355 0.355 0.386 0.291 0.309 ...
## $ ADARB1_N : num [1:1080] 1.34 1.31 1.28 1.2 1.21 ...
## $ AcetylH3K9_N : num [1:1080] 0.17 0.171 0.185 0.16 0.165 ...
## $ RRP1_N : num [1:1080] 0.159 0.158 0.149 0.166 0.161 ...
## $ BAX_N : num [1:1080] 0.189 0.185 0.191 0.185 0.188 ...
## $ ARC_N : num [1:1080] 0.106 0.107 0.108 0.103 0.105 ...
## $ ERBB4_N : num [1:1080] 0.145 0.15 0.145 0.141 0.142 ...
## $ nNOS_N : num [1:1080] 0.177 0.178 0.176 0.164 0.168 ...
## $ Tau_N : num [1:1080] 0.125 0.134 0.133 0.123 0.137 ...
## $ GFAP_N : num [1:1080] 0.115 0.118 0.118 0.117 0.116 ...
## $ GluR3_N : num [1:1080] 0.228 0.238 0.245 0.235 0.256 ...
## $ GluR4_N : num [1:1080] 0.143 0.142 0.142 0.145 0.141 ...
## $ IL1B_N : num [1:1080] 0.431 0.457 0.51 0.431 0.481 ...
## $ P3525_N : num [1:1080] 0.248 0.258 0.255 0.251 0.252 ...
## $ pCASP9_N : num [1:1080] 1.6 1.67 1.66 1.48 1.53 ...
## $ PSD95_N : num [1:1080] 2.01 2 2.02 1.96 2.01 ...
## $ SNCA_N : num [1:1080] 0.108 0.11 0.108 0.12 0.12 ...
## $ Ubiquitin_N : num [1:1080] 1.045 1.01 0.997 0.99 0.998 ...
## $ pGSK3B_Tyr216_N: num [1:1080] 0.832 0.849 0.847 0.833 0.879 ...
## $ SHH_N : num [1:1080] 0.189 0.2 0.194 0.192 0.206 ...
## $ BAD_N : num [1:1080] 0.123 0.117 0.119 0.133 0.13 ...
## $ BCL2_N : num [1:1080] NA NA NA NA NA NA NA NA NA NA ...
## $ pS6_N : num [1:1080] 0.106 0.107 0.108 0.103 0.105 ...
## $ pCFOS_N : num [1:1080] 0.108 0.104 0.106 0.111 0.111 ...
## $ SYP_N : num [1:1080] 0.427 0.442 0.436 0.392 0.434 ...
## $ H3AcK18_N : num [1:1080] 0.115 0.112 0.112 0.13 0.118 ...
## $ EGR1_N : num [1:1080] 0.132 0.135 0.133 0.147 0.14 ...
## $ H3MeK4_N : num [1:1080] 0.128 0.131 0.127 0.147 0.148 ...
## $ CaNA_N : num [1:1080] 1.68 1.74 1.93 1.7 1.84 ...
## $ Genotype : Factor w/ 2 levels "Control","Ts65Dn": 1 1 1 1 1 1 1 1 1 1 ...
## $ Treatment : Factor w/ 2 levels "Memantine","Saline": 1 1 1 1 1 1 1 1 1 1 ...
## $ Behavior : Factor w/ 2 levels "C/S","S/C": 1 1 1 1 1 1 1 1 1 1 ...
## $ class : Factor w/ 8 levels "c-CS-m","c-CS-s",..: 1 1 1 1 1 1 1 1 1 1 ...
Yep, 72 mice.
There are 8 classes (last column in dataset):
c-CS-s: control mice, stimulated to learn, injected with saline (9 mice)
c-CS-m: control mice, stimulated to learn, injected with memantine (10 mice)
c-SC-s: control mice, not stimulated to learn, injected with saline (9 mice)
c-SC-m: control mice, not stimulated to learn, injected with memantine (10 mice)
t-CS-s: trisomy mice, stimulated to learn, injected with saline (7 mice)
t-CS-m: trisomy mice, stimulated to learn, injected with memantine (9 mice)
t-SC-s: trisomy mice, not stimulated to learn, injected with saline (9 mice)
t-SC-m: trisomy mice, not stimulated to learn, injected with memantine (9 mice)
Our groups are almost the same size, I think it wouldn’t be a problem in future anlysis.
How to verify it (if you wish)? Using this code you can easily confirm dataset description:
data_mice %>%
group_by(class) %>% summarise(number = n() / 15)
## # A tibble: 8 x 2
## class number
## * <fct> <dbl>
## 1 c-CS-m 10
## 2 c-CS-s 9
## 3 c-SC-m 10
## 4 c-SC-s 9
## 5 t-CS-m 9
## 6 t-CS-s 7
## 7 t-SC-m 9
## 8 t-SC-s 9
How many NA’s we have? There are 2 ways of checking:
table(is.na(data_mice))
##
## FALSE TRUE
## 88244 1396
summary(data_mice)
## Mouse_ID Experiment DYRK1A_N ITSN1_N BDNF_N
## 18899 : 15 1 : 72 Min. :0.1453 Min. :0.2454 Min. :0.1152
## 293 : 15 10 : 72 1st Qu.:0.2881 1st Qu.:0.4734 1st Qu.:0.2874
## 294 : 15 11 : 72 Median :0.3664 Median :0.5658 Median :0.3166
## 309 : 15 12 : 72 Mean :0.4258 Mean :0.6171 Mean :0.3191
## 311 : 15 13 : 72 3rd Qu.:0.4877 3rd Qu.:0.6980 3rd Qu.:0.3482
## 320 : 15 14 : 72 Max. :2.5164 Max. :2.6027 Max. :0.4972
## (Other):990 (Other):648 NA's :3 NA's :3 NA's :3
## NR1_N NR2A_N pAKT_N pBRAF_N
## Min. :1.331 Min. :1.738 Min. :0.06324 Min. :0.06404
## 1st Qu.:2.057 1st Qu.:3.156 1st Qu.:0.20575 1st Qu.:0.16459
## Median :2.297 Median :3.761 Median :0.23118 Median :0.18230
## Mean :2.297 Mean :3.844 Mean :0.23317 Mean :0.18185
## 3rd Qu.:2.528 3rd Qu.:4.440 3rd Qu.:0.25726 3rd Qu.:0.19742
## Max. :3.758 Max. :8.483 Max. :0.53905 Max. :0.31707
## NA's :3 NA's :3 NA's :3 NA's :3
## pCAMKII_N pCREB_N pELK_N pERK_N
## Min. :1.344 Min. :0.1128 Min. :0.429 Min. :0.1492
## 1st Qu.:2.480 1st Qu.:0.1908 1st Qu.:1.204 1st Qu.:0.3374
## Median :3.327 Median :0.2106 Median :1.356 Median :0.4436
## Mean :3.537 Mean :0.2126 Mean :1.429 Mean :0.5459
## 3rd Qu.:4.482 3rd Qu.:0.2346 3rd Qu.:1.561 3rd Qu.:0.6633
## Max. :7.464 Max. :0.3062 Max. :6.113 Max. :3.5667
## NA's :3 NA's :3 NA's :3 NA's :3
## pJNK_N PKCA_N pMEK_N pNR1_N
## Min. :0.05211 Min. :0.1914 Min. :0.05682 Min. :0.5002
## 1st Qu.:0.28124 1st Qu.:0.2818 1st Qu.:0.24429 1st Qu.:0.7435
## Median :0.32133 Median :0.3130 Median :0.27739 Median :0.8211
## Mean :0.31351 Mean :0.3179 Mean :0.27503 Mean :0.8258
## 3rd Qu.:0.34871 3rd Qu.:0.3523 3rd Qu.:0.30345 3rd Qu.:0.8985
## Max. :0.49343 Max. :0.4740 Max. :0.45800 Max. :1.4082
## NA's :3 NA's :3 NA's :3 NA's :3
## pNR2A_N pNR2B_N pPKCAB_N pRSK_N
## Min. :0.2813 Min. :0.3016 Min. :0.5678 Min. :0.09594
## 1st Qu.:0.5903 1st Qu.:1.3813 1st Qu.:1.1683 1st Qu.:0.40414
## Median :0.7196 Median :1.5637 Median :1.3657 Median :0.44060
## Mean :0.7269 Mean :1.5620 Mean :1.5253 Mean :0.44285
## 3rd Qu.:0.8486 3rd Qu.:1.7485 3rd Qu.:1.8859 3rd Qu.:0.48210
## Max. :1.4128 Max. :2.7240 Max. :3.0614 Max. :0.65096
## NA's :3 NA's :3 NA's :3 NA's :3
## AKT_N BRAF_N CAMKII_N CREB_N
## Min. :0.06442 Min. :0.1439 Min. :0.2130 Min. :0.1136
## 1st Qu.:0.59682 1st Qu.:0.2643 1st Qu.:0.3309 1st Qu.:0.1618
## Median :0.68247 Median :0.3267 Median :0.3603 Median :0.1796
## Mean :0.68224 Mean :0.3785 Mean :0.3634 Mean :0.1805
## 3rd Qu.:0.75969 3rd Qu.:0.4136 3rd Qu.:0.3939 3rd Qu.:0.1957
## Max. :1.18217 Max. :2.1334 Max. :0.5862 Max. :0.3196
## NA's :3 NA's :3 NA's :3 NA's :3
## ELK_N ERK_N GSK3B_N JNK_N
## Min. :0.4977 Min. :1.132 Min. :0.1511 Min. :0.0463
## 1st Qu.:0.9444 1st Qu.:1.992 1st Qu.:1.0231 1st Qu.:0.2204
## Median :1.0962 Median :2.401 Median :1.1598 Median :0.2449
## Mean :1.1734 Mean :2.474 Mean :1.1726 Mean :0.2416
## 3rd Qu.:1.3236 3rd Qu.:2.873 3rd Qu.:1.3097 3rd Qu.:0.2633
## Max. :2.8029 Max. :5.198 Max. :2.4758 Max. :0.3872
## NA's :18 NA's :3 NA's :3 NA's :3
## MEK_N TRKA_N RSK_N APP_N
## Min. :0.1472 Min. :0.1987 Min. :0.1074 Min. :0.2356
## 1st Qu.:0.2471 1st Qu.:0.6171 1st Qu.:0.1496 1st Qu.:0.3663
## Median :0.2734 Median :0.7050 Median :0.1667 Median :0.4020
## Mean :0.2728 Mean :0.6932 Mean :0.1684 Mean :0.4048
## 3rd Qu.:0.3008 3rd Qu.:0.7742 3rd Qu.:0.1845 3rd Qu.:0.4419
## Max. :0.4154 Max. :1.0016 Max. :0.3051 Max. :0.6327
## NA's :7 NA's :3 NA's :3 NA's :3
## Bcatenin_N SOD1_N MTOR_N P38_N
## Min. :1.135 Min. :0.2171 Min. :0.2011 Min. :0.2279
## 1st Qu.:1.827 1st Qu.:0.3196 1st Qu.:0.4104 1st Qu.:0.3520
## Median :2.115 Median :0.4441 Median :0.4525 Median :0.4078
## Mean :2.147 Mean :0.5426 Mean :0.4525 Mean :0.4153
## 3rd Qu.:2.424 3rd Qu.:0.6958 3rd Qu.:0.4880 3rd Qu.:0.4663
## Max. :3.681 Max. :1.8729 Max. :0.6767 Max. :0.9333
## NA's :18 NA's :3 NA's :3 NA's :3
## pMTOR_N DSCR1_N AMPKA_N NR2B_N
## Min. :0.1666 Min. :0.1553 Min. :0.2264 Min. :0.1848
## 1st Qu.:0.6835 1st Qu.:0.5309 1st Qu.:0.3266 1st Qu.:0.5149
## Median :0.7608 Median :0.5767 Median :0.3585 Median :0.5635
## Mean :0.7590 Mean :0.5852 Mean :0.3684 Mean :0.5653
## 3rd Qu.:0.8415 3rd Qu.:0.6344 3rd Qu.:0.4008 3rd Qu.:0.6145
## Max. :1.1249 Max. :0.9164 Max. :0.7008 Max. :0.9720
## NA's :3 NA's :3 NA's :3 NA's :3
## pNUMB_N RAPTOR_N TIAM1_N pP70S6_N
## Min. :0.1856 Min. :0.1948 Min. :0.2378 Min. :0.1311
## 1st Qu.:0.3128 1st Qu.:0.2761 1st Qu.:0.3720 1st Qu.:0.2811
## Median :0.3474 Median :0.3049 Median :0.4072 Median :0.3777
## Mean :0.3571 Mean :0.3158 Mean :0.4186 Mean :0.3945
## 3rd Qu.:0.3927 3rd Qu.:0.3473 3rd Qu.:0.4560 3rd Qu.:0.4811
## Max. :0.6311 Max. :0.5267 Max. :0.7221 Max. :1.1292
## NA's :3 NA's :3 NA's :3 NA's :3
## NUMB_N P70S6_N pGSK3B_N pPKCG_N
## Min. :0.1180 Min. :0.3441 Min. :0.09998 Min. :0.5988
## 1st Qu.:0.1593 1st Qu.:0.8267 1st Qu.:0.14925 1st Qu.:1.2968
## Median :0.1782 Median :0.9313 Median :0.16021 Median :1.6646
## Mean :0.1811 Mean :0.9431 Mean :0.16121 Mean :1.7066
## 3rd Qu.:0.1972 3rd Qu.:1.0451 3rd Qu.:0.17174 3rd Qu.:2.1130
## Max. :0.3166 Max. :1.6800 Max. :0.25321 Max. :3.3820
##
## CDK5_N S6_N ADARB1_N AcetylH3K9_N
## Min. :0.1812 Min. :0.1302 Min. :0.5291 Min. :0.05253
## 1st Qu.:0.2726 1st Qu.:0.3167 1st Qu.:0.9305 1st Qu.:0.10357
## Median :0.2938 Median :0.4010 Median :1.1283 Median :0.15042
## Mean :0.2924 Mean :0.4292 Mean :1.1974 Mean :0.21648
## 3rd Qu.:0.3125 3rd Qu.:0.5349 3rd Qu.:1.3802 3rd Qu.:0.26965
## Max. :0.8174 Max. :0.8226 Max. :2.5399 Max. :1.45939
##
## RRP1_N BAX_N ARC_N ERBB4_N
## Min. :-0.06201 Min. :0.07233 Min. :0.06725 Min. :0.1002
## 1st Qu.: 0.14902 1st Qu.:0.16817 1st Qu.:0.11084 1st Qu.:0.1470
## Median : 0.16210 Median :0.18074 Median :0.12163 Median :0.1564
## Mean : 0.16663 Mean :0.17931 Mean :0.12152 Mean :0.1565
## 3rd Qu.: 0.17741 3rd Qu.:0.19158 3rd Qu.:0.13196 3rd Qu.:0.1654
## Max. : 0.61238 Max. :0.24114 Max. :0.15875 Max. :0.2087
##
## nNOS_N Tau_N GFAP_N GluR3_N
## Min. :0.09973 Min. :0.09623 Min. :0.08611 Min. :0.1114
## 1st Qu.:0.16645 1st Qu.:0.16799 1st Qu.:0.11277 1st Qu.:0.1957
## Median :0.18267 Median :0.18863 Median :0.12046 Median :0.2169
## Mean :0.18130 Mean :0.21049 Mean :0.12089 Mean :0.2219
## 3rd Qu.:0.19857 3rd Qu.:0.23394 3rd Qu.:0.12772 3rd Qu.:0.2460
## Max. :0.26074 Max. :0.60277 Max. :0.21362 Max. :0.3310
##
## GluR4_N IL1B_N P3525_N pCASP9_N
## Min. :0.07258 Min. :0.2840 Min. :0.2074 Min. :0.8532
## 1st Qu.:0.10889 1st Qu.:0.4756 1st Qu.:0.2701 1st Qu.:1.3756
## Median :0.12355 Median :0.5267 Median :0.2906 Median :1.5227
## Mean :0.12656 Mean :0.5273 Mean :0.2913 Mean :1.5483
## 3rd Qu.:0.14195 3rd Qu.:0.5770 3rd Qu.:0.3116 3rd Qu.:1.7131
## Max. :0.53700 Max. :0.8897 Max. :0.4437 Max. :2.5862
##
## PSD95_N SNCA_N Ubiquitin_N pGSK3B_Tyr216_N
## Min. :1.206 Min. :0.1012 Min. :0.7507 Min. :0.5774
## 1st Qu.:2.079 1st Qu.:0.1428 1st Qu.:1.1163 1st Qu.:0.7937
## Median :2.242 Median :0.1575 Median :1.2366 Median :0.8499
## Mean :2.235 Mean :0.1598 Mean :1.2393 Mean :0.8488
## 3rd Qu.:2.420 3rd Qu.:0.1733 3rd Qu.:1.3631 3rd Qu.:0.9162
## Max. :2.878 Max. :0.2576 Max. :1.8972 Max. :1.2046
##
## SHH_N BAD_N BCL2_N pS6_N
## Min. :0.1559 Min. :0.0883 Min. :0.08066 Min. :0.06725
## 1st Qu.:0.2064 1st Qu.:0.1364 1st Qu.:0.11555 1st Qu.:0.11084
## Median :0.2240 Median :0.1523 Median :0.12947 Median :0.12163
## Mean :0.2267 Mean :0.1579 Mean :0.13476 Mean :0.12152
## 3rd Qu.:0.2417 3rd Qu.:0.1740 3rd Qu.:0.14823 3rd Qu.:0.13196
## Max. :0.3583 Max. :0.2820 Max. :0.26151 Max. :0.15875
## NA's :213 NA's :285
## pCFOS_N SYP_N H3AcK18_N EGR1_N
## Min. :0.08542 Min. :0.2586 Min. :0.07969 Min. :0.1055
## 1st Qu.:0.11351 1st Qu.:0.3981 1st Qu.:0.12585 1st Qu.:0.1551
## Median :0.12652 Median :0.4485 Median :0.15824 Median :0.1749
## Mean :0.13105 Mean :0.4461 Mean :0.16961 Mean :0.1831
## 3rd Qu.:0.14365 3rd Qu.:0.4908 3rd Qu.:0.19788 3rd Qu.:0.2045
## Max. :0.25653 Max. :0.7596 Max. :0.47976 Max. :0.3607
## NA's :75 NA's :180 NA's :210
## H3MeK4_N CaNA_N Genotype Treatment Behavior
## Min. :0.1018 Min. :0.5865 Control:570 Memantine:570 C/S:525
## 1st Qu.:0.1651 1st Qu.:1.0814 Ts65Dn :510 Saline :510 S/C:555
## Median :0.1940 Median :1.3174
## Mean :0.2054 Mean :1.3378
## 3rd Qu.:0.2352 3rd Qu.:1.5858
## Max. :0.4139 Max. :2.1298
## NA's :270
## class
## c-CS-m :150
## c-SC-m :150
## c-CS-s :135
## c-SC-s :135
## t-CS-m :135
## t-SC-m :135
## (Other):240
There are 1396 NAs. i don’t want to lose any data for any protein. It should be noted that some proteins measurements data contains extreme amount of NAs. We could simply remove them entirely, but for some proteins the amount of data will shrink drastically.We would do some other stuff: we will replace NAs with means according to class (not the whole column!This type of change (using the whole column) would be a disaster in biological sense). Replacing with 0’s will significantly change other stats.
data_mice[,c(3:79,83)] <- data_mice[,c(3:79,83)] %>% group_by(class) %>%
mutate_all(funs(ifelse(is.na(.), mean(., na.rm = TRUE),.)))
Let’s check if we successfully replaced NAs:
table(is.na(data_mice))
##
## FALSE
## 89640
No NAs found. Now we’re talking.
Let’s have a look at BDNF_N level of each class:
ggplot(data_mice, aes(class, BDNF_N)) + geom_boxplot(aes(fill = class)) + theme_bw() + scale_fill_brewer(palette="Accent") + ggtitle('BDNF_N level plot')
First of all, we should check if BDNF_N data is normally distributed:
shapiro.test(data_mice$BDNF_N)
##
## Shapiro-Wilk normality test
##
## data: data_mice$BDNF_N
## W = 0.99261, p-value = 3.211e-05
Far from normal distribution. But let’s try to use one-way ANOVA despite of this fact,and after then let’s have a look at QQplot of standardized resilduals. If QQplot would be good, so our ANOVA analysis could be applied. It should be noted that we must use pairwise comparison tests after ANOVA. I chose to conduct Tukey post-hoc test.
anova1 <- aov(BDNF_N~class,data_mice)
posthoc1 <- TukeyHSD(x=anova1)
posthoc1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = BDNF_N ~ class, data = data_mice)
##
## $class
## diff lwr upr p adj
## c-CS-s-c-CS-m 0.0030979096 -0.013720984 0.0199168033 0.9992916
## c-SC-m-c-CS-m -0.0482716607 -0.064641970 -0.0319013517 0.0000000
## c-SC-s-c-CS-m -0.0258249025 -0.042643796 -0.0090060088 0.0000951
## t-CS-m-c-CS-m -0.0264852092 -0.043304103 -0.0096663155 0.0000538
## t-CS-s-c-CS-m -0.0337569838 -0.051796186 -0.0157177818 0.0000005
## t-SC-m-c-CS-m -0.0181541101 -0.034973004 -0.0013352164 0.0238955
## t-SC-s-c-CS-m -0.0136310252 -0.030449919 0.0031878685 0.2134012
## c-SC-m-c-CS-s -0.0513695703 -0.068188464 -0.0345506766 0.0000000
## c-SC-s-c-CS-s -0.0289228121 -0.046178633 -0.0116669913 0.0000116
## t-CS-m-c-CS-s -0.0295831188 -0.046838940 -0.0123272979 0.0000064
## t-CS-s-c-CS-s -0.0368548934 -0.055302142 -0.0184076449 0.0000001
## t-SC-m-c-CS-s -0.0212520197 -0.038507841 -0.0039961989 0.0047693
## t-SC-s-c-CS-s -0.0167289348 -0.033984756 0.0005268861 0.0651201
## c-SC-s-c-SC-m 0.0224467582 0.005627864 0.0392656519 0.0013973
## t-CS-m-c-SC-m 0.0217864515 0.004967558 0.0386053452 0.0022588
## t-CS-s-c-SC-m 0.0145146769 -0.003524525 0.0325538789 0.2215086
## t-SC-m-c-SC-m 0.0301175506 0.013298657 0.0469364443 0.0000019
## t-SC-s-c-SC-m 0.0346406355 0.017821742 0.0514595292 0.0000000
## t-CS-m-c-SC-s -0.0006603067 -0.017916128 0.0165955142 1.0000000
## t-CS-s-c-SC-s -0.0079320813 -0.026379330 0.0105151672 0.8967905
## t-SC-m-c-SC-s 0.0076707924 -0.009585028 0.0249266133 0.8792631
## t-SC-s-c-SC-s 0.0121938774 -0.005061943 0.0294496982 0.3859175
## t-CS-s-t-CS-m -0.0072717746 -0.025719023 0.0111754738 0.9328307
## t-SC-m-t-CS-m 0.0083310991 -0.008924722 0.0255869199 0.8253159
## t-SC-s-t-CS-m 0.0128541840 -0.004401637 0.0301100049 0.3156819
## t-SC-m-t-CS-s 0.0156028737 -0.002844375 0.0340501221 0.1686252
## t-SC-s-t-CS-s 0.0201259586 0.001678710 0.0385732071 0.0213198
## t-SC-s-t-SC-m 0.0045230850 -0.012732736 0.0217789058 0.9933333
plot(posthoc1,las=1,cex.axis=0.5,col="brown")
Difference between classes in BDNF_N level are significant.
How to interpret this plot (in naive but right way)? If bar is not in contact with ‘0.0’ line, it means that difference is big enough to be significant.
Let’s check if our results are satisfied to the conditions of applicability of ANOVA.
There are 4 conditions:
-normal distribution of residuals;
-homogenity of residuals variance;
-lack of the collinearity;
-independent measurements in each group/class.
anova_diag <- fortify(anova1)
gg_resid1 <- ggplot(data = anova_diag, aes(x = .fitted, y = .stdresid)) +
geom_point() +
geom_smooth(method = 'lm') +
geom_hline(yintercept = 0)+
geom_hline(yintercept = 2, color = "red") +
geom_hline(yintercept = -2, color = "red")+
xlab('Predicted') + ylab ('Standardized residual') + theme_bw() + ggtitle('Standardized residuals distribution plot')
gg_resid1
## `geom_smooth()` using formula 'y ~ x'
We found no pattern of distribution of residuals.
More fancy plot to display residuals distribution:
ggplot(anova_diag, aes(class, .stdresid, fill = class)) + geom_boxplot() +
xlab('Class') + ylab ('Standardized residuals') + ggtitle ('Standardized residuals distribution plot') +
theme_bw() + scale_fill_brewer(palette="Dark2")
qqPlot(anova_diag$.stdresid)
## [1] 182 918
The distribution is almost normal (except of the upper part of the curve). I think ANOVA can handle this, our results can be trusted.
For all fastidious statisticians striving for perfection, I also made a Kruskal-Wallis test with Dunn test for pairwise comparisons.
kruskal_anova <-data_mice %>% kruskal_test(BDNF_N~class)
pwc_nonpar <-data_mice %>% dunn_test(BDNF_N~class, p.adjust.method = "hochberg")
pwc_nonpar
## # A tibble: 28 x 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 BDNF_N c-CS-m c-CS-s 150 135 0.145 8.85e- 1 8.85e- 1 ns
## 2 BDNF_N c-CS-m c-SC-m 150 150 -8.65 4.97e-18 1.39e-16 ****
## 3 BDNF_N c-CS-m c-SC-s 150 135 -4.21 2.59e- 5 4.66e- 4 ***
## 4 BDNF_N c-CS-m t-CS-m 150 135 -4.53 6.00e- 6 1.26e- 4 ***
## 5 BDNF_N c-CS-m t-CS-s 150 105 -6.02 1.79e- 9 4.31e- 8 ****
## 6 BDNF_N c-CS-m t-SC-m 150 135 -3.03 2.41e- 3 3.13e- 2 *
## 7 BDNF_N c-CS-m t-SC-s 150 135 -2.29 2.21e- 2 2.43e- 1 ns
## 8 BDNF_N c-CS-s c-SC-m 135 150 -8.57 1.05e-17 2.84e-16 ****
## 9 BDNF_N c-CS-s c-SC-s 135 135 -4.24 2.22e- 5 4.44e- 4 ***
## 10 BDNF_N c-CS-s t-CS-m 135 135 -4.55 5.29e- 6 1.16e- 4 ***
## # … with 18 more rows
pwc_nonpar <- pwc_nonpar %>% add_xy_position(x = "class")
ggboxplot(data_mice, x = "class", y = "BDNF_N") +
stat_pvalue_manual(pwc_nonpar, hide.ns = TRUE) +
labs(
subtitle = get_test_label(kruskal_anova, detailed = TRUE),
caption = get_pwc_label(pwc_nonpar)
)
Results are the same.
The first step would be a building of full model which contains all of the predictors:
fit1 <- lm(ERBB4_N~.-Mouse_ID-Experiment-Genotype-Treatment-class-Behavior,data=data_mice)
summary(fit1)
##
## Call:
## lm(formula = ERBB4_N ~ . - Mouse_ID - Experiment - Genotype -
## Treatment - class - Behavior, data = data_mice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.025516 -0.003834 -0.000075 0.003593 0.038233
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0274958 0.0057420 4.789 1.93e-06 ***
## DYRK1A_N -0.0077591 0.0054504 -1.424 0.154877
## ITSN1_N 0.0121514 0.0059337 2.048 0.040833 *
## BDNF_N -0.0018158 0.0142494 -0.127 0.898624
## NR1_N -0.0066853 0.0038133 -1.753 0.079882 .
## NR2A_N 0.0002810 0.0009857 0.285 0.775635
## pAKT_N 0.0290597 0.0144957 2.005 0.045262 *
## pBRAF_N -0.0488960 0.0222090 -2.202 0.027919 *
## pCAMKII_N -0.0001033 0.0004580 -0.225 0.821664
## pCREB_N -0.0430923 0.0201837 -2.135 0.033002 *
## pELK_N -0.0007096 0.0009696 -0.732 0.464427
## pERK_N -0.0023965 0.0026422 -0.907 0.364612
## pJNK_N -0.0380089 0.0167366 -2.271 0.023358 *
## PKCA_N 0.0414818 0.0156256 2.655 0.008063 **
## pMEK_N -0.0075635 0.0162765 -0.465 0.642256
## pNR1_N -0.0201548 0.0089399 -2.254 0.024380 *
## pNR2A_N 0.0053618 0.0044320 1.210 0.226642
## pNR2B_N 0.0033623 0.0038226 0.880 0.379295
## pPKCAB_N 0.0019429 0.0019169 1.014 0.311055
## pRSK_N 0.0155977 0.0088832 1.756 0.079416 .
## AKT_N 0.0118477 0.0060490 1.959 0.050435 .
## BRAF_N -0.0104543 0.0067836 -1.541 0.123604
## CAMKII_N 0.0243916 0.0133265 1.830 0.067501 .
## CREB_N -0.0282832 0.0201232 -1.406 0.160182
## ELK_N 0.0022419 0.0026323 0.852 0.394574
## ERK_N 0.0041749 0.0016293 2.562 0.010543 *
## GSK3B_N 0.0025504 0.0040383 0.632 0.527820
## JNK_N 0.0037357 0.0209840 0.178 0.858738
## MEK_N 0.0063418 0.0140818 0.450 0.652554
## TRKA_N 0.0015545 0.0086639 0.179 0.857645
## RSK_N 0.0373716 0.0250948 1.489 0.136744
## APP_N -0.0001430 0.0083580 -0.017 0.986350
## Bcatenin_N 0.0010572 0.0028745 0.368 0.713102
## SOD1_N -0.0034438 0.0020719 -1.662 0.096798 .
## MTOR_N 0.0379793 0.0119760 3.171 0.001564 **
## P38_N 0.0004134 0.0087135 0.047 0.962165
## pMTOR_N -0.0141851 0.0057399 -2.471 0.013627 *
## DSCR1_N 0.0014968 0.0061796 0.242 0.808664
## AMPKA_N -0.0219883 0.0153626 -1.431 0.152661
## NR2B_N 0.0035785 0.0067541 0.530 0.596352
## pNUMB_N 0.0088084 0.0128920 0.683 0.494610
## RAPTOR_N 0.0134691 0.0166112 0.811 0.417646
## TIAM1_N -0.0177624 0.0121355 -1.464 0.143596
## pP70S6_N 0.0020408 0.0038899 0.525 0.599957
## NUMB_N -0.0428467 0.0237500 -1.804 0.071520 .
## P70S6_N -0.0068535 0.0035596 -1.925 0.054470 .
## pGSK3B_N 0.0680863 0.0259059 2.628 0.008714 **
## pPKCG_N -0.0072380 0.0013190 -5.488 5.16e-08 ***
## CDK5_N 0.0097937 0.0094400 1.037 0.299768
## S6_N 0.0002482 0.0043802 0.057 0.954832
## ADARB1_N -0.0003681 0.0013816 -0.266 0.789971
## AcetylH3K9_N 0.0045392 0.0041963 1.082 0.279642
## RRP1_N -0.0208306 0.0092596 -2.250 0.024690 *
## BAX_N -0.0039991 0.0238661 -0.168 0.866960
## ARC_N 0.1436985 0.0398005 3.610 0.000321 ***
## nNOS_N 0.0138962 0.0189010 0.735 0.462383
## Tau_N 0.0510379 0.0112052 4.555 5.89e-06 ***
## GFAP_N -0.0535168 0.0311494 -1.718 0.086092 .
## GluR3_N -0.0206878 0.0125084 -1.654 0.098457 .
## GluR4_N -0.0218156 0.0111955 -1.949 0.051621 .
## IL1B_N 0.0202090 0.0073967 2.732 0.006403 **
## P3525_N 0.0880711 0.0160245 5.496 4.92e-08 ***
## pCASP9_N 0.0126000 0.0022052 5.714 1.46e-08 ***
## PSD95_N 0.0175607 0.0021005 8.360 < 2e-16 ***
## SNCA_N 0.0199547 0.0240340 0.830 0.406583
## Ubiquitin_N -0.0007648 0.0033327 -0.229 0.818540
## pGSK3B_Tyr216_N 0.0227687 0.0065117 3.497 0.000492 ***
## SHH_N 0.0085335 0.0128293 0.665 0.506103
## BAD_N -0.0463235 0.0159592 -2.903 0.003782 **
## BCL2_N 0.0122268 0.0149683 0.817 0.414210
## pS6_N NA NA NA NA
## pCFOS_N -0.0490420 0.0166173 -2.951 0.003238 **
## SYP_N 0.0215151 0.0065234 3.298 0.001007 **
## H3AcK18_N 0.0089403 0.0070752 1.264 0.206663
## EGR1_N 0.0105255 0.0111806 0.941 0.346722
## H3MeK4_N 0.0203880 0.0094645 2.154 0.031465 *
## CaNA_N -0.0065095 0.0024444 -2.663 0.007869 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.006661 on 1004 degrees of freedom
## Multiple R-squared: 0.8182, Adjusted R-squared: 0.8046
## F-statistic: 60.25 on 75 and 1004 DF, p-value: < 2.2e-16
Hm, 1 NA suddenly found. It’s a pS6_N, and NA means that this variable is linearly related to the other variables, thus causing perfect collinearity. This fact can also be found when making VIF estimation:
vif(fit1)
## Error in vif.default(fit1): there are aliased coefficients in the model
perf_coll_vars <- attributes(alias(fit1)$Complete)$dimnames[[1]]
We have to remove pS6_N from our model.
fit2 <- update(fit1, .~. - pS6_N)
Time to diagnostics!
It;s time to do everyday’s job.
fit2_diag <- fortify(fit2)
head(fit2_diag)
## ERBB4_N DYRK1A_N ITSN1_N BDNF_N NR1_N NR2A_N pAKT_N pBRAF_N
## 1 0.1449893 0.5036439 0.7471932 0.4301753 2.816329 5.990152 0.2188300 0.1775655
## 2 0.1504709 0.5146171 0.6890635 0.4117703 2.789514 5.685038 0.2116362 0.1728170
## 3 0.1453302 0.5091831 0.7302468 0.4183088 2.687201 5.622059 0.2090109 0.1757222
## 4 0.1406558 0.4421067 0.6170762 0.3586263 2.466947 4.979503 0.2228858 0.1764626
## 5 0.1419830 0.4349402 0.6174298 0.3588022 2.365785 4.718679 0.2131059 0.1736270
## 6 0.1395645 0.4475064 0.6281758 0.3673881 2.385939 4.807635 0.2185778 0.1762334
## pCAMKII_N pCREB_N pELK_N pERK_N pJNK_N PKCA_N pMEK_N
## 1 2.373744 0.2322238 1.750936 0.6879062 0.3063817 0.4026984 0.2969273
## 2 2.292150 0.2269721 1.596377 0.6950062 0.2990511 0.3859868 0.2813189
## 3 2.283337 0.2302468 1.561316 0.6773484 0.2912761 0.3810025 0.2817103
## 4 2.152301 0.2070042 1.595086 0.5832768 0.2967287 0.3770870 0.3138320
## 5 2.134014 0.1921579 1.504230 0.5509601 0.2869612 0.3635021 0.2779643
## 6 2.141282 0.1951875 1.442398 0.5663396 0.2898239 0.3638930 0.2668369
## pNR1_N pNR2A_N pNR2B_N pPKCAB_N pRSK_N AKT_N BRAF_N CAMKII_N
## 1 1.0220603 0.6056726 1.877684 2.308745 0.4415994 0.8593658 0.4162891 0.3696080
## 2 0.9566759 0.5875587 1.725774 2.043037 0.4452219 0.8346593 0.4003642 0.3561775
## 3 1.0036350 0.6024488 1.731873 2.017984 0.4676679 0.8143294 0.3998469 0.3680888
## 4 0.8753903 0.5202932 1.566852 2.132754 0.4776707 0.7277046 0.3856387 0.3629700
## 5 0.8649120 0.5079898 1.480059 2.013697 0.4834161 0.6877937 0.3675305 0.3553109
## 6 0.8591209 0.5213066 1.538244 1.968275 0.4959000 0.6724022 0.3694045 0.3571717
## CREB_N ELK_N ERK_N GSK3B_N JNK_N MEK_N TRKA_N RSK_N
## 1 0.1789443 1.866358 3.685247 1.537227 0.2645263 0.3196770 0.8138665 0.1658460
## 2 0.1736797 1.761047 3.485287 1.509249 0.2557270 0.3044187 0.7805042 0.1571935
## 3 0.1739047 1.765544 3.571456 1.501244 0.2596135 0.3117467 0.7851540 0.1608954
## 4 0.1794489 1.286277 2.970137 1.419710 0.2595358 0.2792181 0.7344917 0.1622099
## 5 0.1748355 1.324695 2.896334 1.359876 0.2507050 0.2736672 0.7026991 0.1548274
## 6 0.1797285 1.227450 2.956983 1.447910 0.2508402 0.2840436 0.7043958 0.1568759
## APP_N Bcatenin_N SOD1_N MTOR_N P38_N pMTOR_N DSCR1_N
## 1 0.4539098 3.037621 0.3695096 0.4585385 0.3353358 0.8251920 0.5769155
## 2 0.4309403 2.921882 0.3422793 0.4235599 0.3248347 0.7617176 0.5450973
## 3 0.4231873 2.944136 0.3436962 0.4250048 0.3248517 0.7570308 0.5436197
## 4 0.4106149 2.500204 0.3445093 0.4292113 0.3301208 0.7469798 0.5467626
## 5 0.3985498 2.456560 0.3291258 0.4087552 0.3134148 0.6919565 0.5368605
## 6 0.3910472 2.467133 0.3275978 0.4044899 0.2962764 0.6744186 0.5397231
## AMPKA_N NR2B_N pNUMB_N RAPTOR_N TIAM1_N pP70S6_N NUMB_N
## 1 0.4480993 0.5862714 0.3947213 0.3395706 0.4828639 0.2941698 0.1821505
## 2 0.4208761 0.5450973 0.3682546 0.3219592 0.4545193 0.2764306 0.1820863
## 3 0.4046298 0.5529941 0.3638799 0.3130859 0.4471972 0.2566482 0.1843877
## 4 0.3868603 0.5478485 0.3667707 0.3284919 0.4426497 0.3985340 0.1617677
## 5 0.3608164 0.5128240 0.3515510 0.3122063 0.4190949 0.3934470 0.1602002
## 6 0.3542143 0.5143164 0.3472241 0.3031321 0.4128243 0.3825783 0.1623303
## P70S6_N pGSK3B_N pPKCG_N CDK5_N S6_N ADARB1_N AcetylH3K9_N
## 1 0.8427252 0.1926084 1.443091 0.2947000 0.3546045 1.339070 0.1701188
## 2 0.8476146 0.1948153 1.439460 0.2940598 0.3545483 1.306323 0.1714271
## 3 0.8561658 0.2007373 1.524364 0.3018807 0.3860868 1.279600 0.1854563
## 4 0.7602335 0.1841694 1.612382 0.2963818 0.2906795 1.198765 0.1597991
## 5 0.7681129 0.1857183 1.645807 0.2968294 0.3093450 1.206995 0.1646503
## 6 0.7796946 0.1867930 1.634615 0.2880373 0.3323671 1.123445 0.1756929
## RRP1_N BAX_N ARC_N nNOS_N Tau_N GFAP_N GluR3_N
## 1 0.1591024 0.1888517 0.1063052 0.1766677 0.1251904 0.1152909 0.2280435
## 2 0.1581289 0.1845700 0.1065922 0.1783090 0.1342751 0.1182345 0.2380731
## 3 0.1486963 0.1905322 0.1083031 0.1762129 0.1325604 0.1177602 0.2448173
## 4 0.1661123 0.1853235 0.1031838 0.1638042 0.1232096 0.1174394 0.2349467
## 5 0.1606870 0.1882214 0.1047838 0.1677096 0.1368377 0.1160478 0.2555277
## 6 0.1505939 0.1838235 0.1064762 0.1748445 0.1305147 0.1152432 0.2368495
## GluR4_N IL1B_N P3525_N pCASP9_N PSD95_N SNCA_N Ubiquitin_N
## 1 0.1427556 0.4309575 0.2475378 1.603310 2.014875 0.1082343 1.0449792
## 2 0.1420366 0.4571562 0.2576322 1.671738 2.004605 0.1097485 1.0098831
## 3 0.1424450 0.5104723 0.2553430 1.663550 2.016831 0.1081962 0.9968476
## 4 0.1450682 0.4309959 0.2511031 1.484624 1.957233 0.1198832 0.9902247
## 5 0.1408705 0.4812265 0.2517730 1.534835 2.009109 0.1195244 0.9977750
## 6 0.1364536 0.4785775 0.2444853 1.507777 2.003535 0.1206872 0.9201782
## pGSK3B_Tyr216_N SHH_N BAD_N BCL2_N pCFOS_N SYP_N H3AcK18_N
## 1 0.8315565 0.1888517 0.1226520 0.1325395 0.1083359 0.4270992 0.1147832
## 2 0.8492704 0.2004036 0.1166822 0.1325395 0.1043154 0.4415813 0.1119735
## 3 0.8467087 0.1936845 0.1185082 0.1325395 0.1062193 0.4357769 0.1118829
## 4 0.8332768 0.1921119 0.1327812 0.1325395 0.1112620 0.3916910 0.1304053
## 5 0.8786678 0.2056042 0.1299541 0.1325395 0.1106939 0.4341538 0.1184814
## 6 0.8436793 0.1904695 0.1315752 0.1325395 0.1094457 0.4398331 0.1166572
## EGR1_N H3MeK4_N CaNA_N .hat .sigma .cooksd .fitted
## 1 0.1317900 0.1281856 1.675652 0.06057898 0.006664179 1.189185e-04 0.1474064
## 2 0.1351030 0.1311187 1.743610 0.05444041 0.006664331 7.143134e-05 0.1484818
## 3 0.1333618 0.1274311 1.926427 0.06371890 0.006663435 3.262995e-04 0.1492211
## 4 0.1474442 0.1469011 1.700563 0.04211131 0.006664598 8.129227e-06 0.1414286
## 5 0.1403143 0.1483799 1.839730 0.04532238 0.006663116 2.875387e-04 0.1463989
## 6 0.1407664 0.1421804 1.816389 0.04620239 0.006661507 6.023966e-04 0.1458891
## .resid .stdresid
## 1 -0.0024170821 -0.3743697
## 2 0.0019890237 0.3070683
## 3 -0.0038908789 -0.6036483
## 4 -0.0007728722 -0.1185467
## 5 -0.0044158661 -0.6784642
## 6 -0.0063245944 -0.9721740
ggplot(fit2_diag, aes(x = 1:nrow(fit2_diag), y = .cooksd)) +
geom_bar(stat = "identity") +
geom_hline(yintercept = 2, color = "red") + theme_bw() + ggtitle('Cooks distance plot')
No one is ever close to Cooks distance borderline == 2. It means we don’t have any influential observation.
gg_resid2 <-ggplot(data = fit2_diag, aes(x = .fitted, y = .stdresid)) +
geom_point() +
geom_smooth(method = 'lm') +
geom_hline(yintercept = 0)+
geom_hline(yintercept = 2, color = "red") +
geom_hline(yintercept = -2, color = "red")+
xlab('Predicted') + ylab ('Standardized residual') + theme_bw() + ggtitle('Standardized residuals distribution plot')
gg_resid2
## `geom_smooth()` using formula 'y ~ x'
No patterns found (but we have some outliers).
ggplot(data_mice, aes(x = ERBB4_N, y = 1:nrow(data_mice))) +
geom_point() + theme_bw() +
labs(y = 'Observation number', x = 'Predicted variable values') + ggtitle('Autocorrelations plot')
There are autocorrelations in our data since there were multiple measurements for each mouse.
Normality check:
ggplot(data = fit2_diag, aes(x = 1:nrow(fit2_diag), y = .stdresid)) +
geom_point() + theme_bw() +
labs(y = 'Standardized Residual', x = 'Observation number') + ggtitle('Normality check plot')
And QQ plots:
qqPlot(fit2_diag$.stdresid)
## [1] 142 794
qqPlot(fit2_diag$.fitted)
## [1] 89 88
Finally,multicollinearity checking:
vif(fit2)
## DYRK1A_N ITSN1_N BDNF_N NR1_N NR2A_N
## 44.807821 54.073873 12.007796 42.532262 20.517338
## pAKT_N pBRAF_N pCAMKII_N pCREB_N pELK_N
## 8.835000 8.746842 8.538556 10.495956 4.972901
## pERK_N pJNK_N PKCA_N pMEK_N pNR1_N
## 20.202296 18.351865 16.156089 13.694438 26.973124
## pNR2A_N pNR2B_N pPKCAB_N pRSK_N AKT_N
## 16.840731 25.987996 20.682571 8.513558 14.409174
## BRAF_N CAMKII_N CREB_N ELK_N ERK_N
## 52.275059 11.803356 6.829365 18.713024 27.485327
## GSK3B_N JNK_N MEK_N TRKA_N RSK_N
## 23.692339 12.271350 8.086172 26.572626 12.090341
## APP_N Bcatenin_N SOD1_N MTOR_N P38_N
## 6.344710 37.550805 8.203507 14.916410 14.678849
## pMTOR_N DSCR1_N AMPKA_N NR2B_N pNUMB_N
## 11.979259 9.371754 22.415257 8.605554 16.008886
## RAPTOR_N TIAM1_N pP70S6_N NUMB_N P70S6_N
## 19.631134 16.178653 8.985047 11.772174 9.204233
## pGSK3B_N pPKCG_N CDK5_N S6_N ADARB1_N
## 6.083727 14.154825 3.027777 8.812942 6.074277
## AcetylH3K9_N RRP1_N BAX_N ARC_N nNOS_N
## 14.703563 2.121039 4.909060 7.850880 5.394238
## Tau_N GFAP_N GluR3_N GluR4_N IL1B_N
## 14.539923 4.131776 4.630194 2.202969 8.958019
## P3525_N pCASP9_N PSD95_N SNCA_N Ubiquitin_N
## 5.625320 7.280353 6.942966 8.191988 8.137325
## pGSK3B_Tyr216_N SHH_N BAD_N BCL2_N pCFOS_N
## 9.171033 3.363377 4.430391 3.191940 3.575784
## SYP_N H3AcK18_N EGR1_N H3MeK4_N CaNA_N
## 4.566674 3.857973 4.264088 5.458854 14.612223
Unfortunately, some of conditions of LM usage have not met. Our model suffers from autocorrelation, abnormal distribution, dependecy of observations and ton of multicollinearity. One might say ‘ok boomer our model explains ~80% of variance best model af’, but using this model would be a suicide for any serious researcher.
Let’s do this.
pca_res <- rda(data_mice[,3:79],scale = T)
Screeplot:
screeplot(pca_res, type = "lines", bstick = TRUE, main = 'Screeplot with Broken Stick model')
Biplot with symmetrical scaling:
biplot(pca_res)
Biplot of ordinations:
biplot(pca_res,scaling = "sites", display = "sites", main = 'Ordinations biplot')
Biplot of correlations:
biplot(pca_res,scaling = "species", display = "species", main = 'Correlations biplot')
It’s barely readable because of its clustering. We aren’t able to say where are observations of different clusters.
Let’s try to colour ordinations using data from original dataset. We need to add first 3 PCs to origianl data and using class data to colour ordination plot. Let’s build 2D plot first.
data_with_PCs <- data.frame(data_mice,
scores(pca_res, display = "sites", choices = c(1, 2, 3), scaling = "sites"))
ggplot(data_with_PCs, aes(x = PC1, y = PC2)) +
geom_point(aes(color = class), alpha = 0.5) +
ggtitle(label = 'Ordinations biplot (coloured)') + theme_bw() + scale_fill_brewer(palette="Spectral")
And 3D plot using plotly:
fig_3D_3D <- plot_ly(data_with_PCs, x = ~PC1, y = ~PC2, z = ~PC3, marker = list(size = 2), color = ~class, colors = 'BrBG')
fig_3D <- fig_3D_3D %>% add_markers()
fig_3D <- fig_3D %>% layout(scene = list(xaxis = list(title = 'PC1'),
yaxis = list(title = 'PC2'),
zaxis = list(title = 'PC3')))
fig_3D
Plotly is a wondeful library that allows to plot interactive plots. Okay, plots are made, and we can clearly see overlapping of classes. But how much variance can be explained by each PC? Especially with first 3 PCs we used to plot. We already plotted screeplot, but knowing the cumulative proportion would be handy.
pca_res_PC_only <- as.data.frame(summary(pca_res)$cont)
pca_res_PC_only[3, 3]
## [1] 0.5303548
So,first 3 PCs explain ~53% of variance. Let’s build more sophisticated plot dispalying impact of each PC.
pca_summary <- summary(pca_res)
pca_result <- as.data.frame(pca_summary$cont)
plot_data <- as.data.frame(t(pca_result[c("Proportion Explained"),]))
plot_data$component <- seq(1:nrow(plot_data))
ggplot(plot_data, aes( component, `Proportion Explained`)) + geom_bar(stat = "identity") + theme_bw() + xlab('PC') + ggtitle('Screeplot without Broken Stick')
And that’s all! Thank you for your attention!